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Abstract 

In this paper we denonstrate that the use of the system of 2-adic numbers provides a new insight to 
i some problems of genetics, in particular, generacy of the genetic code and the structure of the PAM matrix 

in bioinformatics. The 2-adic distance is an ultrametric and applications of ultrametrics in bioinformatics 
are not surprising. However, by using the 2-adic numbers we match ultrametric with a number theoretic 
structure. In this way we find new applications of an ultrametric which differ from known up to now in 
h-^- ■ bioinformatics. 

We obtain the following results. We show that the PAM matrix A allows the expansion into the sum 
of the two matrices A = A^ + A^°°', where the matrix A^ is 2-adically regular (i.e. matrix elements 
of this matrix are close to locally constant with respect to the discussed earlier by the authors 2-adic 
parametrization of the genetic code), and the matrix yl' 00 ' is sparse. We discuss the structure of the matrix 
i-^ ■ ^t 00 ) m relation to the side chain properties of the corresponding amino acids. 

cr 

1 Introduction 

cn ■ 

Various clustering procedures play a crucial role in bioinformatics, in particular, genetics, see, e.g., [TJ [2] or 
£^ ' [5] . An important class of such procedures is based on introduction of various metrics on the space information 
, strings, see e.g. [S]- A metric with new interesting features was recently used in theoretical physics (from string 
' theory to theory of disordered systems, spin glasses), see e.g. [B], [7], [5], [S] in cognitive science, psychology 
, and image analysis [ID] . This is so called p-adic metric (in fact, a class of metrics depending on the parameter 
' p — a prime number) . The main distinguishing feature of this metric is its sensitivity to hierarchic patterns in 
ON . information having a special structure matching with p-adic encoding of information^ 

A few years ago 2-adic metric was applied to study the problem of degeneration of the genetic code, see 
J> , [TTl IT21 [T5] . These p-adic models can be considered as new development in the approach to investigation of the 
structure of the genetic code from the point of view of coding theory, see jUJ [T31 EES] ■ 

In the present paper we discuss the structure of the PAM matrix used in bioinformatics (see for example 
C$ ' PQ) from the point of view of p-adic analysis. We use the 2-adic parametrization of the genetic (amino acid) 
code obtained in [TT] (see also [T^] for the different p-adic parametrization). 

In [TT][T2] it was shown that, after some special parametrization of the space of codons (triples of nucleotides) 
the genetic code becomes a locally constant map of p-adic argument. Moreover, the degeneracy of the genetic 
code in this language takes the form of local constancy of the corresponding mapping. 

Let us also mention the application of the p-adic parametrization to the description of the Parisi matrix 
from the replica symmetry breaking approach to spin glasses [171 118j . After the p-adic parametrization of the 
numbers of the lines and the columns the Parisi matrix becomes a locally constant block matrix. 

It is natural to check, using the p-adic parametrization approach, the structure of the PAM matrix. The 
PAM matrix is used in bioinformatics for sequence alignment and is constructed using a Markov chain model 
of point mutations for a protein chain. 
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We assume that the structure of the PAM matrix has some relation to the structure of the genetic code. 
Using this idea we enumerate the lines and the columns of the PAM matrix using the 2-adic parametrization of 
the genetic code. After this parametrization the PAM matrix becomes more regular, namely, the dependence 
of the matrix elements Aij of the PAM matrix on the indices i and j is close to locally constant with respect to 
the 2-adic norm for the majority of matrix elements. 

We have some exceptions from this rule. It is easy to see that these exceptions are related to several amino 
acids, namely to Y, W, C, F, L. In order to describe this deviations from 2-adicity we introduce the following 
construction: we expand (by hands) the PAM matrix into the sum of the two matrices 

A = A^+A {co \ 

The matrix in this expansion A^ is 2-adically regular (close to locally constant). The matrix A^ 00 ^ is sparse 
(the majority of matrix elements are zero, non zero matrix elements are mainly concentrated of the lines and 
columns related to the amino acids Y, W, C, F, L). 

One can see that the deviations from 2-adicity (i.e. non-zero matrix elements of A^°°^) are related to amino 
acids which are in some sense special — to the aromatic amino acids Y, W, F, and to Cysteine C which contains 
the SH group. 

We also mention that the 2-adic structure of the genetic code is related to some chemical properties of the 
amino acids. In particular, hydrophobic amino acids are clustered in two ball with respect to the 2-adic norm. 
Therefore the 2-adic parametrization allows to separate the impact of the chemical and geometrical properties 
of aromatic amino acids for the structure of the PAM matrix. 

The structure of the present paper is the following. 

In Section 2 we discuss some family of ultrametric spaces. 

In Section 3 we describe the 2-adic 2-dimensional parametrization of the genetic code of [llj . 
In Section 4 we put the PAM250 matrix. 

In Section 5 we describe the reshuffling of the lines and the columns of the PAM matrix, corresponding to 
the 2-adic parametrization of the genetic code of Section 2. 

In Section 6 we introduce the expansion of the PAM matrix into the sum of the two matrices, one of which 
is 2-adically regular (close to locally constant) and the other is sparse (majority of matrix elements are equal 
to zero). 

Sections 7 and 8 are appendices where the definitions of PAM matrices and the eucaryotic genetic code are 
exposed. 

2 Ultrametric spaces 

An ultrametric space is a metric space where the metric d(x, y) satisfies the strong triangle inequality: 

d{x, y) < max (d(x, z), d{y, z)) , Vx, y, z. 

The strong triangle inequality can be stated geometrically: each side of a triangle is at most as long as the 
longest one of the two other sides. Such a triangle is quite restricted when considered in the ordinary Euclidean 
space — it is isosceles, i.e., d(x, y) = d(y, z) or d(x, z) — d(y, z) or d(x, y) — d(z, x). 

An ultrametric space is a natural mathematical object for description of a hierarchical system. On ultrametric 
spaces there exist many locally constant functions, i.e., functions which are constant on some vicinity of any 
point, but not necessarily constant on the whole space. In particular, we show that the genetic code can be 
considered as a locally constant map on a specially designed ultrametric space, so called 2-adic plane, see [llj . 

Let (X,d) be an ultrametric space. We consider balls U r (a) — {x E X : d(x,a) < r}, r > 0, x e X. So, a is 
the center of the ball U r {a) having radius r. We mention a few unusual (from the viewpoint of usual Euclidean 
geometry) properties of ultrametric balls: 

a) . Each point of U r (a) can be chosen as its center. So, inside a ball all points have "equal rights". 

b) . Any two balls either do not intersect or one of the balls contains the other ball. In this framework, 
clustering into disjoint balls is a very natural operation. 
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We remark that ultrametric spaces were widely used in bioinformatics, see [UH]. However, in this paper we 
plan to elaborate new applications of ultrametric spaces to biology (genetics) which are different from mentioned 
ones. 

We are interested in the following special class of ultrametric spaces (X, d). Every point x is the infinite 
sequence of digits 

x = (a ,a u ■ • • ,a n ,. ..) . (1) 
Each digit yields the finite number of values, 

on £ A m = {0, . . . , to — f }, (2) 

where to > 1 is a natural number, the base of the alphabet A m . 

If the sequence x = (a a , q?i, . . . , a n , • • • ) contains only finite number of non-zero terms (ao : ai, • • • j &-n): then 
we can consider x as the natural number 

n 

x = aim 1 . (3) 

i=0 

Moreover, this formula defines the one to one correspondence between natural numbers (with zero) and the 
space of final sequences x = (ao, ot\, . . . , a„). 

We denote the space of sequences ([1]), ([2]) by the symbol Z m . Ultrametric is introduced on this set in the 
following way. For two points 

x = (ao, ai,a 2 , ...,«„,...)> V = (PoiPufo, ■ ■ ■ , Pn, ■ ■ ■ ), 

we set 

d m (x,y) = — t: if aj = /3j, j = 0, 1, . . . , k - 1, and a k ^ /3k- 

The ultrametric space (X = Z m , d = d m ) is called the space of m-adic integers. 

The ultrametric d m describes the following hierarchical structure. If x = (ai, a%, . . . , . . . ), <x, = 
0, 1, . . . , to — I, is a vector encoding information on some object, then digits aj have different weights. The digit 
ao is the most important, a\ dominates over a2, ■ ■ ■ , a n , . . . , and so on. Such hierarchic information vectors 
can be created by living systems, e.g., by the brain to process information. Applications of m-adic numbers 
to information theory and, in particular, to description of cognitive processes and complex social systems were 
developed in [II UHl [III EH EH [22l [23] . 

In applications we will use not only "one dimensional" m,-adic spaces, but also cartesian products of a few 
spaces, e.g., TO-adic plane = Z m x Z TO and so on. Our aim is to show that 2-adic plane structure was 
embedded in the genetic code, see 

We remark that m-adic numbers for m — p, where p is a prime number were intensively used (during last 
20 years) in mathematical physics [3 [9]. The number theoretic definition is as follows. 

Let us fix a prime number p > 1. For example, fix p — 2 or fix p — 1999. The example of ultrametric space 
is the field of p-adic numbers Q p , which is the completion of the field of rational numbers with respect to p-adic 
norm \x\ p , defined as follows: for a rational number x = P 1 ^, where 7 = 0, ±1, ±2, . . . , and m, n are non-zero 
and are not divisible by p, its p-adic norm is 

\x\ p =p~ J . 

p-Adic norm is widely used in number theory and algebraic geometry. 

If a rational number is divisible by p 1 ', where 7 is very large, then its p-adic norm is very small. This 
hierarchy of the degrees of p (i.e. of values of the p-adic norm) gives the hierarchical (ultrametric) structure of 
the p-adic norm. p-Adic numbers are in one to one correspondence with the series 

00 

x = }^Xjp l , Xi = 0, . . . ,p - I, 

i—'y 

where 7 is integer and x 1 ^ (this expansion is the analogue of ([3])). 

The unit ball in Q p with center at a = zero coincides with the space of p-adic integers Z p . 

p-Adic numbers were actively used (since pioneer papers of I.Volovich, 24J) in high energy physics (super- 
string theory, cosmology) and theory of disordered systems (spin glasses), see, e.g., review [9] and the book 
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3 2-Adic parametrization of the genetic code 

In the 2-adic parametrization approach of [11] we enumerate in some special way the set of 64 codons (triples 
of nucleotides) by pairs of digits (x, y), x, y = 0, 1, 2, . . . , 7. These digits are in one to one correspondence with 
the triples (xq, Xi, x%) and {yo 1 yi,y2) of and 1 (the expansions of x and y over degrees of 2): 

x = xq + 2xi + 4x 2 , y = yo + 2yi + 4y 2 , Xi,yi = 0,1. 

2-Adic norm of x is equal to 2~ l , where i is the number of the first non zero Xi in the above expansion 
(analogously for y). 

Each pair of digits (xi, yi) is defined by a nucleotide using the rule 



A 


G 




00 


01 


U 


C 




10 


11 



Since the nucleotides A = (0,0) and G — (0,1) are purines, U — (1,0) and C = (1,1) are pyrimidines, 
the different first digits in the above binary representation corresponds to the different chemical types of the 
nucleotides. Namely, the nucleotide (x,y) with x — is a purine, and the nucleotide (x,y) with x = 1 is a 
pyrimidine. 

The second digit y = Q, 1 in the considered parametrization describes the TJ-bonding character (weak for 
y = and strong for y = 1). 

Using the above correspondence between the nucleotides and the digits and 1, we introduce the corre- 
spondence between the codons (triples of nucleotides) and the triples (xq, xi, X2) and (yoj 2/2) of and 1 
(equivalently, the corresponding x, y = 0, 1, 2, . . . , 7) by the following prescription. 

The second nucleotide in the codon defines the pair (xo,yo), the first nucleotide in the codon defines the 
pair (x\, yi), and the third nucleotide in the codon defines the pair (x2, y 2 ). 

This rule is related to the following hierarchy of nucleotides in the codon 

2 > 1 > 3 

i.e the second nucleotide in the codon is the most important (and the largest in the 2-adic norm) and the third 
nucleotide is the least important. 

After that we make the special reshuffling of the values of x and y: 

0,4,2,6,1,5,3,7^ 1,2,3,4,5,6,7,8. 

similar to made in [17j . 

Then we enumerate codons using the described above rule. Namely we put the codons in the table 8x8 
with the natural 2-adic norm (here the numbers of the lines and columns are x and y defined above) : 



AAA AAG 
AAU AAC 


GAA GAG 
GAU GAC 


AGA AGG 
AGU AGC 


GGA GGG 
GGU GGC 


UAA UAG 
UAU UAG 


GAA CAG 
GAU GAC 


UGA UGG 
UGU UGG 


CGA CGG 
CGU CGC 


AUA AUG 
AUU AUG 


GUA GUG 
GUU GUC 


ACA ACG 
ACU ACC 


GCA GCG 
GCU GCC 


UUA UUG 
UUU UUG 


GUA GUG 
GUU GUC 


UCA UCG 
UGU UCC 


CCA GCG 
GCU CCC 



The 2x2 quadrates here are 2-dimensional 2-adic balls (or clusters) of the diameter 1/4. 

After application of the eucaryotic genetic (amino acid) code (described in the Appendix) to the above table 
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we get the table of amino acids on the 2-adic plane 



K 


E 


R 


G 


N 


D 


S 


Tor 


Q 


Tcr|W 


R 


Y 


H 


C 


I|M 
I 


V 


T 


A 


L 
F 


L 


S 


P 



where Ter is the stop codon. Each square in the above table is the 2 by 2 square in the 2-adic plane of codons. 
In particular, the genetic code map acts as follows on the 2-adic balls 

AAA AAG 
AAU AAC 



Here we use the standard notations for the nucleotides A, U, G, C and the amino acids. 

We see, that the degeneracy of the genetic code in the above 2-adic parametrization is described by the 
2-adic proximity — the codons which encode the same amino acid are 2-adically close. Moreover, the domains 
with the different degeneracy are symmetric at the 2-adic plane — the lower right half of the plane is occupied 
by amino acids with the degeneracy four, and the upper left half of the plane contains the amino acids with the 
degeneracy mainly equal to two. We also have the five cases of additional degeneracy which is not described by 
the 2-adic parametrization. 

The described here 2-adic 2-dimensional parametrization of the genetic code is related to physical-chemical 
properties of the amino acids. Namely, the hydrophobic amino acids are clustered in the following two 2-adic 
balls in the 2-adic plane: 















Tcr|W 
C 




I|M 
I 


V 






L 

F 


L 







Here the hydrophobic amino acids are listed according to the book [25] . 

Using the 2-adic parametrization of the genetic code we can divide all the set of amino acid in the groups 
{K, N, E, D, Y, Q, H}, {R, G, W, C}, {I, M, V, L, F}, {T, A, S, P}. These groups are the images with respect 
to the map of the genetic code of the four quadrants (2-adic balls of the diameter 1/2) of the 2-adic plane of 
codons. 

4 The PAM matrix 

The following table describes the PAM250 matrix: 



K 




N 


J 



CCA 


CCG 


ecu 


CCC 



■m 
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A 
L \ 


V 
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x 


c 


XI 


T 
1 


K 


T 
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M 

1VX 


IN 


P 


n 


XV 


c 
o 


T 

X 


v 

V 


w 
vv 


X 


A 


o 
z 


o 

— z 


n 
u 
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u 


A 

— *± 


i 
i 


i 

— i 


1 

— ± 


i 

— X 


9 

— z 


1 

— X 


n 
u 


1 

X 


n 
u 


9 

— z 


1 

X 


i 
X 


n 
u 


p. 


<3 

— 


C 
v 




1 9 
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K 

— 


r. 

— 


A 
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— o 


<3 

— o 


9 

— z 


r. 
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K 
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A 

— ^X 


•3 

— O 


— 


A 

— ^ 


n 
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9 

— z 


9 

— z 


Q 

— O 
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u 


u 






A 
*X 


Q 
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— 


i 
i 
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± 


9 

— z 


n 
u 


X 

— *x 


o 

— o 


9 
z 


1 

— X 


9 
z 


1 

— X 


n 
u 


u 


9 

— z 


7 


A 


V 

1 . 




% 









n 
u 


1 
i 


9 
z 


n 

u 


Q 
O 


9 
z 


1 

X 


1 

X 


9 
z 


1 
X 


n 
u 


u 


9 
z 


7 


X 
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Q 
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9 
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— X 


— O 
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JJ 
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_2 
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_1 
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n 


J 
















o 
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2 
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_2 
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* 


* 


* 


* 


* 
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* 


* 


* 


6 








i 





-1 


-6 


-5 


Q 


* 


* 


* 


* 


* 


* 




* 


* 


* 


* 


* 


* 


4 


1 


-i 


-1 


-2 


-5 


-4 


R 


* 


* 


* 


* 


* 


* 


* 


* 


* 


* 


* 


* 




* 


6 





-1 


-2 


2 


-4 


S 


* 


* 


* 


* 




* 


* 


* 


* 


* 


* 


* 


* 


* 


* 


2 


1 


-1 


-2 


-3 


T 


* 


* 


* 


* 


* 


* 


* 


* 


* 


* 


* 




* 


* 


* 


* 


3 





-5 


-3 


V 


* 


* 


* 


* 


* 


* 




* 


* 


* 


* 


* 


* 




* 


* 


* 


4 


-6 


-2 


w 
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* 


* 


* 


* 


* 


* 


* 


* 


* 


* 






* 


* 


* 


* 


17 





Y 


* 




* 


* 


* 


* 




* 




* 




* 






* 




* 




* 
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Because the matrix A is symmetrical we put only the hall of matrix elements. This matrix looks irregular 
and has no any obvious structure. In the next Section we will show that some reshuffling of the numbers of the 
lines and columns will put this matrix in more regular form. 

5 Reshuffling of matrix elements of the PAM matrix 

Let us enumerate the lines and the columns of the PAM matrix A of the previous Section using the 2-adic 
parametrization of the genetic code. We get for A the following: 



* 


K 


N 


E 


D 


Y 


Q 


H 


R 


G 


W 


C 


I 


M 


V 


L 


F 


T 


A 


S 


P 


K 


5 


1 








-4 


l 





3 


-2 


-3 


-5 


-2 





-2 


-3 


-5 





-1 





-1 


N 


1 


2 


1 


2 


-2 


i 


2 








-4 


-4 


-2 


-2 


-2 


-3 


-4 








1 


-1 


E 





1 


4 


3 


-4 


2 


1 
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-7 
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-2 


-2 


-2 


-3 


-5 











-1 


D 





2 


3 


4 


-4 


2 


1 
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1 


-7 


-5 


-2 


-3 


-2 


-4 


-6 











-1 


Y 
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-4 


-4 


10 
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-4 


-5 
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-2 


-2 


-1 


7 


-3 


-3 


-3 


-5 


Q 


1 


1 


2 


2 


-4 


4 


3 


1 


-1 


-5 


-5 


-2 


-1 


-2 


-2 


-5 
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H 





2 
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6 
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G 
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1 


-5 
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-3 


5 


-7 
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-3 


-3 


-1 


-4 


-5 





1 


1 


-1 


W 


-3 


-4 


-7 


-7 





-5 


-3 


2 


-7 


17 


-8 
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-6 
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-6 


c 
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-3 


-4 


-3 


-8 


12 


-2 


-5 


-2 


-6 


-4 


-2 


-2 





-3 


I 


-2 


-2 


-2 


-2 


-1 


-2 


-2 


-2 
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-2 
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1 





-1 


-1 


-2 


M 
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-2 


-3 


-2 


-1 


-2 





-3 


-4 


-5 


2 


6 


2 


4 
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-1 


-2 


-2 


V 
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-2 


-2 


-2 


-2 


-1 


-6 


-2 


4 
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2 


-1 
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-1 


L 
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-3 
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-5 
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Here wc divided all the set of amino acids to the groups {K, N, E, D, Y, Q, H}, {R, G, W, C}, {I, M, V, L, 
F}, {T, A, S, P}. These four groups correspond to the second nucleotide in the codons which encode (through 
the amino acid genetic code) amino acids in the corresponding group. Namely, the first group correspond to the 
codons with A (Adenine) at the second position, the second group correspond to the codons with G (Guanine), 
the third group correspond to the codons with U (Uracil) , and the fourth group correspond to the codons with 
C (Cytosine). 

Compared to the PAM matrix from the previous Section, this reshuffled matrix is more regular. It has large 
positive matrix elements at the main diagonal, and off diagonal terms are close to block constant at least at 
some of the 16 blocks, corresponding to matrix elements with the indices from one of the 4 groups of amino 
acids. 

In particular, in the block corresponding to the matrix elements Aij, i,j e {T, A, S, P}, all off diagonal 
matrix elements are equal to 1 (10 matrix elements) or (2 matrix elements). 

In the block Aij, i E {K, N, E, D, Y, Q, H}, j e {T, A, S, P} we have matrix elements equal to (13 matrix 
elements), —1 (10 matrix elements), 1 (1 matrix element) and anomalous matrix elements equal to —3 and —5 
corresponding to the amino acid Y. The analogous situation we will have in the other blocks of the matrix A. 

We arrive to the following picture: the PAM matrix A will be a block matrix with matrix elements close to 
locally constant (i.e. constant in the 16 blocks) if we will exclude matrix elements corresponding to some amino 
acids, namely, to the amino acids Y, W, C, L, F, and R. 



6 Expansion for the PAM matrix 

In the present Section we introduce the main construction of this paper: we will expand the PAM matrix A in 
the sum of the matrices A^ and A^ 00 ' 1 : 

A = A^+A { °°\ 

where the matrix A^ 1 will be 2-adically regular (matrix elements are close to locally constant), and the matrix 
^4(°°) w in be sparse (i.e. majority of matrix elements of this matrix will be equal to zero). 
We propose the following choice for matrices A^ and A^°°\ 
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Figure 1: tryptophan W, arginine R, cysteine C 




Figure 2: phenylalanine F, tyrosine Y 
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Non zero matrix elements of A^°°^ are mainly concentrated on the lines and columns corresponding to Y, 
W, C, L, F. There are also several non-zero matrix elements corresponding to R and some other amino acids. 

We see that the non zero matrix elements of A^ 00 " 1 are mainly concentrated on aromatic amino acids, such 
as Y, F, W, and on C which contains the SH group. Therefore deviations from 2-adic regularity (i.e. the block 
structure of the A^ matrix) can be discussed as related to the geometric properties of the side chains of amino 
acids (for aromatic amino acids Y, F, W, and for Arginine R), and to the ability of Cysteine C to create a 
disulfide bond. 
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The above pictures show the amino acids W, R, C, F, Y corresponding to non-zero matrix elements of the 
matrix A^ 00 " 1 . Let us mention that amino acids F and Y for which the corresponding matrix element A^) is 
very large (and majority of the other matrix elements of A^°°^ for these amino acids are negative) are very 
similar from the point of view of geometry. 

Of course, our analysis of PAM matrix based on the 2-adic plane representation of the genetic code is only 
the first step in using p-adic numbers in genetics and bioinformatics in general. We hope to proceed towards 
other important problems, cf., e.g., pQ- [I], [55]. Finally, we mention the famous "In Silico Biology" project, 
see, e.g., Yamato et al. [27]. We point out that, in fact, operation of any computer can be represented as 
2-adic dynamical system, see [10], [28]. Therefore 2-adic representation of the genetic code might be useful in 
realization of the "In Silico Biology" project. 

7 Appendix: the PAM matrix 

In this section we discuss the construction of the Dayhoff PAM matrix, which can be found for example in [H [2] • 
We start with blocks — ungapped multiple alignments of proteins from existing databases. Any sequence in 
the block is no more than 15% different from any other sequence in this block. 

Then, the Markov model, which reproduces the mentioned blocks of proteins was constructed. This Markov 
model is defined by the amino acid substitutions (point mutations). We have the stationary distribution p a 
of the probabilities of the amino acids, ^2 a —iPa = 1, and the transition probability p a b, normalized by the 
condition that the probability of a point mutation (substitution of the amino acid) at one step of the Markov 
model is equal to 0.01: 

20 

^ PabPb = 0.01. 
a, 6=1 

Then we take the matrix given by the n steps of the Markov model, i.e. the n-th degree P n of the matrix 
P = (pab), and consider the matrix with the matrix elements 

A(n) = logi ° {^r) ■ 

This matrix is known as the PAM matrix (usually n is taken to be equal to 250 and the matrix elements are 
approximated by integers). 

8 Appendix: the genetic code 

The following table describes the eucaryotic genetic code — the correspondence between codons (triples of 
nucleotides and amino acids): 
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AAA K 
AAU N 
AAG K 
AAC N 


UAA Ter 
UAU Y 
UAG ler 
UAC Y 


GAA E 
GAU D 
GAG F 
GAC D 


CAA Q 
CAU H 
CAG Q 
CAC H 


AUA I 
AUU I 
AUG M 
AUC I 


UUA L 
UUU F 

T TT m T 

U UG L 
UUC F 


GUA V 
GUU V 
GUG V 
GUC V 


CUA L 
CUU L 
GUG L 
CUC L 


AGA R 
AGU S 

Ann I > 

AGC S 


UGA Ter 
UGU C 

UUrUr W 

UGC C 


GGA G 
GGU G 

UrLrCl Ur 

GGC G 


CGA R 
CGU R 

CGC R 


ACA T 
ACU T 
ACG T 
ACC T 


UCA S 

ucu s 

UCG S 

ucc s 


GCA A 
GCU A 
GCG A 
GCC A 


CCA P 

ecu P 

CCG P 
CCC P 
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